In our previous lesson, we created a PBMC object, completed clustering, and performed annotation. This serves as the foundation for all our subsequent analyses. In this lesson, we’ll systematically explore various visualization techniques using SeuratExtend to create high-quality, publication-ready figures.
Let’s start by loading the necessary packages and our previously saved PBMC object:
library(Seurat)
library(SeuratExtend)
# Load the PBMC object we saved at the end of Lesson 2
pbmc <- readRDS("rds/pbmc_annotated.rds")
In Seurat, dimension reduction plots such as UMAP are typically
created using DimPlot for discrete variables and
FeaturePlot for continuous variables.
SeuratExtend simplifies this process with
DimPlot2, which does not require differentiation between
variable types. This function automatically recognizes the type of input
parameters, whether discrete or continuous. Additionally,
DimPlot2 introduces numerous extra parameters to enrich the
customization of the plots.
To generate a basic dimension reduction plot, simply call
DimPlot2 with your Seurat object:
DimPlot2(pbmc)
DimPlot2 can handle both discrete and continuous
variables seamlessly. Here’s how to input different variables into the
plot:
DimPlot2(pbmc, features = c("cluster", "MS4A1", "CD14", "CD3D"))
You can also split the visualization by a specific variable, which is particularly useful for comparative analysis across conditions or identities. In our current PBMC object, we only have one sample. Let’s create a new column in the metadata to simulate a scenario with two samples for demonstration purposes:
set.seed(42) # For reproducibility
pbmc$sample <- sample(c("sample1", "sample1", "sample2"), size = ncol(pbmc), replace = TRUE)
table(pbmc$sample) # Check the distribution
##
## sample1 sample2
## 1796 904
Now, let’s create a split plot:
DimPlot2(pbmc, features = c("cluster", "CD14"), split.by = "sample", ncol = 1)
To highlight cells of interest, such as a specific cluster, you can define the cells explicitly and use them in your plot:
b_cells <- colnames(pbmc)[pbmc$cluster == "B cell"]
DimPlot2(pbmc, cells.highlight = b_cells)
For each variable, you can specify custom colors, adjust themes, and more. For detailed information on color customization, refer to the Explore Color Functions section:
DimPlot2(
pbmc,
features = c("cluster", "sample", "CD14", "CD3D"),
cols = list(
"cluster" = "light",
"CD14" = "D",
"CD3D" = c("#EEEEEE", "black")
),
theme = NoAxes())
To further enhance the plot, you can add labels and bounding boxes to clearly delineate different groups or points of interest:
DimPlot2(pbmc, label = TRUE, box = TRUE, label.color = "black", repel = TRUE, theme = NoLegend())
Sometimes, cluster names are too lengthy and can make the plot appear cluttered when displayed with labels. To address this, consider using indices to replace the cluster names in the plot, which helps make the visualization cleaner. For instance, you can label clusters as ‘C1’, ‘C2’, etc., on the plot itself, while detailing what each index stands for (e.g., ‘C1: B cell’, ‘C2: CD4 T Memory’) in the figure legend:
DimPlot2(pbmc, index.title = "C", box = TRUE, label.color = "black")
This approach ensures that the plot remains legible and aesthetically pleasing, even when dealing with numerous or complex labels.
In SeuratExtend, a unique visualization method allows
for the simultaneous display of three features on the same dimension
reduction plot. The functions FeaturePlot3 and
FeaturePlot3.grid employ a color mixing system (either RYB
or RGB) to represent three different genes (or other continuous
variables). This method uses the principles of color mixing to
quantitatively display the expression levels or intensities of these
three features in each cell.
In the RGB system, black represents no or low expression, and
brighter colors indicate higher levels:
In the RYB system, white represents no expression, and deeper colors
indicate higher expression levels:
Here’s how to display three markers using the RYB system, with red for CD3D, yellow for CD14, and blue for CD79A:
FeaturePlot3(pbmc, color = "ryb", feature.1 = "CD3D", feature.2 = "CD14", feature.3 = "CD79A", pt.size = 0.5)
For the RGB system, with red for CD3D, green for CD14, and blue for CD79A:
FeaturePlot3(pbmc, color = "rgb", feature.1 = "CD3D", feature.2 = "CD14", feature.3 = "CD79A", pt.size = 1)
FeaturePlot3.gridFeaturePlot3.grid extends FeaturePlot3 by
allowing multiple plots to be generated in one go. The
features parameter requires a vector where every three
values are assigned a color (RYB or RGB) and placed together in one
plot. If you wish to skip a color, use NA as a
placeholder.
For instance, to place the following five genes into two plots using the RYB system, and skip yellow in the second plot:
FeaturePlot3.grid(pbmc, features = c("CD3D", "CD14", "CD79A", "FCGR3A", NA, "LYZ"), pt.size = 0.5)
The background is usually white, so the choice of color system and
point size can significantly affect visual perception. In the RYB
system, where higher expression results in darker colors, a smaller
pt.size is preferable to prevent overlapping points. In
contrast, in the RGB system, higher expressions result in lighter
colors, potentially leading to visibility issues for highly expressed
cells that may blend into the white background. Here, a larger
pt.size is recommended so that the darker, low-expression
points can form a “background” to highlight the lighter, high-expression
points.
The ClusterDistrBar function is designed to visualize
the distribution of clusters across different samples. It can show both
absolute counts and proportions, and it allows for various
customizations including axis reversal and normalization.
To create a basic bar plot showing the distribution of clusters within samples, simply specify the origin (sample identifier) and cluster variables from your dataset:
ClusterDistrBar(origin = pbmc$sample, cluster = pbmc$cluster)
If you prefer to visualize the absolute cell count rather than
proportions, set the percent parameter to
FALSE:
ClusterDistrBar(origin = pbmc$sample, cluster = pbmc$cluster, percent = FALSE)
If a vertical orientation is preferred over the default horizontal
bars, set the flip parameter to FALSE:
ClusterDistrBar(origin = pbmc$sample, cluster = pbmc$cluster, flip = FALSE, reverse_order = TRUE)
If you prefer not to stack the bars, which can be useful for direct
comparisons of cluster sizes across samples, set the stack
parameter to FALSE:
ClusterDistrBar(origin = pbmc$sample, cluster = pbmc$cluster, flip = FALSE, stack = FALSE)
In cases where a visual plot is not required and only the underlying
data matrix is needed, set the plot parameter to
FALSE:
data_matrix <- ClusterDistrBar(origin = pbmc$sample, cluster = pbmc$cluster, plot = FALSE)
# View the matrix
print(data_matrix)
## sample1 sample2
## B cell 12.973274 12.8318584
## CD4 T Memory 18.040089 19.1371681
## CD4 T Naive 25.222717 23.7831858
## CD8 T cell 12.472160 11.8362832
## DC 1.113586 1.7699115
## Mono CD14 18.095768 18.4734513
## Mono FCGR3A 5.679287 6.3053097
## NK cell 6.013363 5.0884956
## Platelet 0.389755 0.7743363
While DimPlot uses color gradients to represent expression levels,
violin plots offer a more quantitative view of expression distribution.
The enhanced VlnPlot2 function in SeuratExtend integrates
functionalities to superimpose boxplots, easily add statistical
annotations, and offers greater flexibility in plot presentation
compared to the original VlnPlot in Seurat.
The method to employ VlnPlot2 will differ depending on
whether you’re using a Seurat object or a matrix. In this lesson, we’ll
focus on using a Seurat object.
Let’s start by selecting the genes we want to analyze. Here’s an example using three genes:
genes <- c("CD3D", "CD14", "CD79A")
VlnPlot2(pbmc, features = genes, ncol = 1)
VlnPlot2(pbmc, features = genes, pt = FALSE, hide.outlier = TRUE, ncol = 1)
VlnPlot2(pbmc, features = genes, group.by = "cluster", split.by = "sample")
cells <- colnames(pbmc)[pbmc$cluster %in% c("B cell", "Mono CD14", "CD8 T cell")]
VlnPlot2(pbmc, features = genes, group.by = "cluster", cells = cells)
VlnPlot2(pbmc, features = genes, group.by = "cluster", cells = cells,
stat.method = "wilcox.test", hide.ns = TRUE)
VlnPlot2(pbmc, features = genes, group.by = "cluster", cells = cells,
stat.method = "t.test", comparisons = list(c(1,2), c(1,3)), hide.ns = FALSE)
The Heatmap function provides a flexible and
comprehensive way to visualize matrices, especially those produced by
the CalcStats function.
First, let’s generate a sample matrix using the
CalcStats function:
# Assuming pbmc data and VariableFeatures function are available
genes <- VariableFeatures(pbmc)
toplot <- CalcStats(pbmc, features = genes, method = "zscore", order = "p", n = 4)
head(toplot, 10)
## B cell CD4 T Memory CD4 T Naive CD8 T cell DC Mono CD14
## CD79A 2.6617833 -0.3210925 -0.3572368 -0.3804323 -0.17865513 -0.3571722
## CD79B 2.4991391 -0.4657078 -0.5071176 -0.4044852 -0.42079797 -0.5171697
## MS4A1 2.6640982 -0.3491962 -0.3920821 -0.3623228 -0.26088265 -0.3756385
## TCL1A 2.6529328 -0.3658656 -0.3601593 -0.3457438 -0.07058998 -0.3691003
## LTB 1.3010444 1.4217753 0.9983871 0.1222852 -0.68256844 -0.9210437
## CD2 -0.8165695 1.5900673 0.7584343 1.3917165 -0.39990926 -0.8386137
## GIMAP7 -1.3653377 1.2997189 0.6523547 0.8297530 -0.46845278 -0.3241219
## AQP3 -0.6067480 2.1996497 0.9870714 0.2866108 -0.59156893 -0.7475342
## MALAT1 0.5479293 0.5091795 0.8464071 0.8334376 -0.46049421 -0.5621087
## MYC 0.6918531 1.2562329 1.6091033 0.3373354 -0.57521155 -0.7662308
## Mono FCGR3A NK cell Platelet
## CD79A -0.35758943 -0.3419031 -0.3677019
## CD79B 0.56448664 -0.1825349 -0.5658126
## MS4A1 -0.31078578 -0.3372859 -0.2759042
## TCL1A -0.34777379 -0.3677649 -0.4259352
## LTB -0.31881117 -0.8799306 -1.0411381
## CD2 -0.71405995 -0.0121216 -0.9589440
## GIMAP7 -0.11671624 0.9434342 -1.4506321
## AQP3 -0.68100419 -0.4787048 -0.3677718
## MALAT1 -0.05320274 0.6045094 -2.2656573
## MYC -0.70390444 -0.6826998 -1.1664781
Now, we can produce a basic heatmap:
Heatmap(toplot, lab_fill = "zscore")
Note that this provides a convenient method to display marker genes (or differentially expressed genes, DEGs) for each cluster, which is very practical. However, this is not the only method. Let’s also learn about Seurat’s built-in method for finding DEGs.
# Find all markers of B cells
bcell.markers <- FindMarkers(pbmc, ident.1 = "B cell", logfc.threshold = 1, only.pos = TRUE)
head(bcell.markers)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## CD79A 0.000000e+00 6.783307 0.934 0.043 0.000000e+00
## MS4A1 0.000000e+00 5.737084 0.860 0.052 0.000000e+00
## LINC00926 2.387605e-276 7.258980 0.562 0.009 3.971542e-272
## CD79B 2.983575e-275 4.595655 0.911 0.143 4.962879e-271
## TCL1A 8.485045e-274 6.612804 0.619 0.022 1.411402e-269
## HLA-DQA1 4.817482e-269 3.984138 0.885 0.118 8.013400e-265
This compares all B cells to non-B cells to find DEGs. If you want to
compare B cells with specific clusters, you can set the
ident.2 parameter.
We can also use FindAllMarkers to conveniently generate
marker genes for all clusters:
pbmc.markers <- FindAllMarkers(pbmc, logfc.threshold = 2, only.pos = TRUE)
## Calculating cluster B cell
## Calculating cluster CD4 T Memory
## Calculating cluster CD4 T Naive
## Calculating cluster CD8 T cell
## Calculating cluster DC
## Calculating cluster Mono CD14
## Calculating cluster Mono FCGR3A
## Calculating cluster NK cell
## Calculating cluster Platelet
head(pbmc.markers, 10)
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## CD79A 0.000000e+00 6.783307 0.934 0.043 0.000000e+00 B cell CD79A
## MS4A1 0.000000e+00 5.737084 0.860 0.052 0.000000e+00 B cell MS4A1
## LINC00926 2.387605e-276 7.258980 0.562 0.009 3.971542e-272 B cell LINC00926
## CD79B 2.983575e-275 4.595655 0.911 0.143 4.962879e-271 B cell CD79B
## TCL1A 8.485045e-274 6.612804 0.619 0.022 1.411402e-269 B cell TCL1A
## HLA-DQA1 4.817482e-269 3.984138 0.885 0.118 8.013400e-265 B cell HLA-DQA1
## VPREB3 6.983605e-238 6.957038 0.484 0.007 1.161653e-233 B cell VPREB3
## HLA-DQB1 3.947592e-234 4.054111 0.862 0.146 6.566424e-230 B cell HLA-DQB1
## CD74 1.691378e-188 3.004657 1.000 0.818 2.813439e-184 B cell CD74
## HLA-DRA 1.834090e-187 2.864834 1.000 0.490 3.050826e-183 B cell HLA-DRA
Sometimes, the first name on the x-axis might be too long and exceed
the left boundary of the plot. To prevent this issue and ensure all
labels are fully visible, you can increase the space on the left side of
the plot by adjusting the plot.margin parameter:
Heatmap(toplot, lab_fill = "zscore", plot.margin = margin(l = 30))
For denser matrices, you may wish to only show a subset of gene names:
genes <- VariableFeatures(pbmc)[1:500]
toplot2 <- CalcStats(pbmc, features = genes, method = "zscore", order = "p")
Heatmap(toplot2, lab_fill = "zscore", feature_text_subset = genes[1:20], expand_limits_x = c(-0.5, 12))
You can also split the heatmap based on gene groups:
gene_groups <- sample(c("group1", "group2", "group3"), nrow(toplot2), replace = TRUE)
Heatmap(toplot2, lab_fill = "zscore", facet_row = gene_groups) +
theme(axis.text.y = element_blank())
Waterfall plots are powerful visualization tools that can display differences between two conditions, showing gene expression, gene set enrichment, or other metrics. This function can handle inputs directly from Seurat objects or pre-processed matrices.
You can use the waterfall plot to compare expression levels of genes directly from a Seurat object, using LogFC to determine the bar length. Here’s how to do it for the top 80 variable features:
genes <- VariableFeatures(pbmc)[1:80]
WaterfallPlot(
pbmc, group.by = "cluster", features = genes,
ident.1 = "Mono CD14", ident.2 = "CD8 T cell", length = "logFC")
To further hone in on the most differentially expressed genes, you might want to keep only the top and bottom 20 genes. This can highlight the most critical differences between the two cell types:
WaterfallPlot(
pbmc, group.by = "cluster", features = genes,
ident.1 = "Mono CD14", ident.2 = "CD8 T cell", length = "logFC",
top.n = 20)
Additionally, you can set parameters to modify the plot, such as
filtering features based on a specific threshold, rotating the plot 90
degrees (by setting flip), and more. For detailed options,
refer to the function’s help documentation.
When choosing between heatmaps, violin plots, and waterfall plots, consider the following:
SeuratExtend provides various color themes to choose from. You can
adjust color-related parameters (such as cols or
col_theme) in visualization functions like
DimPlot2, VlnPlot2, Heatmap, and
WaterfallPlot. For more detailed information, refer to the
online tutorial: https://huayc09.github.io/SeuratExtend/articles/Visualization.html#explore-color-functions
Here are some recommended color schemes for visualization:
For discrete variables, SeuratExtend’s “light” color scheme:
library(cowplot)
plot_grid(
DimPlot2(pbmc, cols = "light", theme = NoAxes() + NoLegend()),
ClusterDistrBar(pbmc$sample, pbmc$cluster, cols = "light", flip = FALSE, border = "black", reverse_order = FALSE) +
theme(axis.title.x = element_blank())
)
For continuous variables, the “A” or “D” color schemes from viridis:
Heatmap(toplot, lab_fill = "zscore", color_scheme = "A")
WaterfallPlot(
pbmc, group.by = "cluster", features = genes,
ident.1 = "Mono CD14", ident.2 = "CD8 T cell", length = "logFC",
top.n = 20, color_theme = "D")
You can also use various color palettes from RColorBrewer (https://r-graph-gallery.com/38-rcolorbrewers-palettes.html):
library(RColorBrewer)
markers_genes <- c(
"MS4A1", "GNLY", "CD3E",
"CD8A", "CCR7", "CD14",
"FCER1A", "FCGR3A", "PPBP")
DimPlot2(
pbmc,
features = markers_genes,
cols = brewer.pal(9, "GnBu"),
theme = NoAxes(), pt.size = 0.3)
DimPlot2(
pbmc,
features = markers_genes,
cols = rev(brewer.pal(11,"Spectral")),
theme = NoAxes(), pt.size = 0.3)
SeuratExtend will soon be updated to include more
built-in color schemes for selection. Stay tuned for these exciting
additions!
While SeuratExtend provides many convenient functions
for visualization, all of these are built on top of
ggplot2. Understanding ggplot2 can give you
more flexibility in customizing your plots and creating entirely new
visualizations. In this section, we’ll cover the basics of
ggplot2 and how to use it with Seurat data.
ggplot2 is based on the Grammar of Graphics, a layered approach to creating visualizations. The basic components are:
Let’s start by creating the most basic UMAP plot using ggplot2:
# Extract data from the Seurat object
umap_data <- FetchData(pbmc, vars = c("umap_1", "umap_2", "cluster", "CD3D", "sample"))
head(umap_data)
## umap_1 umap_2 cluster CD3D sample
## AAACATACAACCAC-1 2.8657047 4.409265 CD4 T Naive 2.863463 sample1
## AAACATTGAGCTAC-1 6.4355502 -13.392225 B cell 0.000000 sample1
## AAACATTGATCAGC-1 0.5606911 1.904406 CD4 T Memory 3.489090 sample1
## AAACCGTGCTTCCG-1 -10.5961605 -3.310064 Mono FCGR3A 0.000000 sample1
## AAACCGTGTATGCG-1 8.9127020 3.276543 NK cell 0.000000 sample1
## AAACGCACTGGTAC-1 0.7217073 5.876705 CD4 T Naive 1.726522 sample1
# Create the most basic plot
ggplot(umap_data, aes(x = umap_1, y = umap_2, color = CD3D)) +
geom_point()
This is the most basic plot we can create. It shows the UMAP coordinates of our cells, colored by CD3D expression. However, it’s not very visually appealing or informative in its current state. Let’s improve it step by step:
# Create an improved plot
library(viridis)
ggplot(umap_data, aes(x = umap_1, y = umap_2, color = CD3D)) +
geom_point(size = 0.5, alpha = 0.7) +
scale_color_viridis() +
theme_minimal() +
labs(title = "UMAP colored by CD3D expression",
x = "UMAP_1", y = "UMAP_2")
Let’s break down the improvements we made:
size = 0.5, alpha = 0.7 in geom_point(). This
helps to visualize dense areas better.scale_color_viridis() to apply a
colorblind-friendly, perceptually uniform color scale.theme_minimal() to remove unnecessary plot
elements and give it a cleaner look.labs().These small changes significantly improve the readability and aesthetics of our plot. In the following sections, we’ll explore even more ways to customize and enhance our visualizations.
Now let’s customize our plot further:
ggplot(umap_data, aes(x = umap_1, y = umap_2, color = CD3D)) +
geom_point(size = 0.5, alpha = 0.7) +
scale_color_viridis(option = "A") +
facet_wrap(~sample, ncol = 3) +
theme_bw() +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank(),
strip.background = element_rect(fill = "white"),
strip.text = element_text(face = "bold")) +
labs(title = "UMAP by cluster, colored by CD3D expression",
x = "UMAP_1", y = "UMAP_2")
In this version:
facet_wrap() to create separate plots for
each cluster.theme() to
remove axis text, ticks, and grid lines.Often, we want to combine multiple plots into a single figure. The
cowplot package is great for this:
library(cowplot)
# Create two plots
plot1 <- ggplot(umap_data, aes(x = umap_1, y = umap_2, color = cluster)) +
geom_point(size = 0.5, alpha = 0.7) +
scale_color_manual(values = color_pro(9,2)) +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "Clusters")
plot2 <- ggplot(umap_data, aes(x = umap_1, y = umap_2, color = CD3D)) +
geom_point(size = 0.5, alpha = 0.7) +
scale_color_viridis_c(option = "A") +
theme_minimal() +
labs(title = "CD3D Expression")
# Combine plots
combined_plot <- plot_grid(plot1, plot2, labels = c("A", "B"), ncol = 2)
# Add a title to the combined plot
title <- ggdraw() +
draw_label("UMAP Visualization of PBMC Data", fontface = "bold", x = 0, hjust = 0) +
theme(plot.margin = margin(0, 0, 0, 7))
plot_grid(title, combined_plot, ncol = 1, rel_heights = c(0.1, 1))
Finally, let’s save our plot using ggsave():
dir.create("results", showWarnings = FALSE)
ggsave("results/pbmc_umap_plot.png", combined_plot, width = 8, height = 4.5, dpi = 300)
This will save the plot as a high-resolution PNG file.
Remember, while ggplot2 offers extensive customization options, it can sometimes be complex. Don’t hesitate to consult the ggplot2 documentation, online resources, or AI assistants like Claude or ChatGPT for help with specific customizations.
This concludes our lesson on advanced visualization techniques. You now have a solid foundation in creating and customizing plots for your single-cell RNA-seq data analysis. In the next lesson, we’ll explore more advanced analytical techniques to further your understanding of single-cell data.